594 research outputs found

    Software trace cache

    Get PDF
    We explore the use of compiler optimizations, which optimize the layout of instructions in memory. The target is to enable the code to make better use of the underlying hardware resources regardless of the specific details of the processor/architecture in order to increase fetch performance. The Software Trace Cache (STC) is a code layout algorithm with a broader target than previous layout optimizations. We target not only an improvement in the instruction cache hit rate, but also an increase in the effective fetch width of the fetch engine. The STC algorithm organizes basic blocks into chains trying to make sequentially executed basic blocks reside in consecutive memory positions, then maps the basic block chains in memory to minimize conflict misses in the important sections of the program. We evaluate and analyze in detail the impact of the STC, and code layout optimizations in general, on the three main aspects of fetch performance; the instruction cache hit rate, the effective fetch width, and the branch prediction accuracy. Our results show that layout optimized, codes have some special characteristics that make them more amenable for high-performance instruction fetch. They have a very high rate of not-taken branches and execute long chains of sequential instructions; also, they make very effective use of instruction cache lines, mapping only useful instructions which will execute close in time, increasing both spatial and temporal locality.Peer ReviewedPostprint (published version

    Instruction fetch architectures and code layout optimizations

    Get PDF
    The design of higher performance processors has been following two major trends: increasing the pipeline depth to allow faster clock rates, and widening the pipeline to allow parallel execution of more instructions. Designing a higher performance processor implies balancing all the pipeline stages to ensure that overall performance is not dominated by any of them. This means that a faster execution engine also requires a faster fetch engine, to ensure that it is possible to read and decode enough instructions to keep the pipeline full and the functional units busy. This paper explores the challenges faced by the instruction fetch stage for a variety of processor designs, from early pipelined processors, to the more aggressive wide issue superscalars. We describe the different fetch engines proposed in the literature, the performance issues involved, and some of the proposed improvements. We also show how compiler techniques that optimize the layout of the code in memory can be used to improve the fetch performance of the different engines described Overall, we show how instruction fetch has evolved from fetching one instruction every few cycles, to fetching one instruction per cycle, to fetching a full basic block per cycle, to several basic blocks per cycle: the evolution of the mechanism surrounding the instruction cache, and the different compiler optimizations used to better employ these mechanisms.Peer ReviewedPostprint (published version

    Enlarging instruction streams

    Get PDF
    The stream fetch engine is a high-performance fetch architecture based on the concept of an instruction stream. We call a sequence of instructions from the target of a taken branch to the next taken branch, potentially containing multiple basic blocks, a stream. The long length of instruction streams makes it possible for the stream fetch engine to provide a high fetch bandwidth and to hide the branch predictor access latency, leading to performance results close to a trace cache at a lower implementation cost and complexity. Therefore, enlarging instruction streams is an excellent way to improve the stream fetch engine. In this paper, we present several hardware and software mechanisms focused on enlarging those streams that finalize at particular branch types. However, our results point out that focusing on particular branch types is not a good strategy due to Amdahl's law. Consequently, we propose the multiple-stream predictor, a novel mechanism that deals with all branch types by combining single streams into long virtual streams. This proposal tolerates the prediction table access latency without requiring the complexity caused by additional hardware mechanisms like prediction overriding. Moreover, it provides high-performance results which are comparable to state-of-the-art fetch architectures but with a simpler design that consumes less energy.Peer ReviewedPostprint (published version

    Pressure Bifurcation Phenomenon on Supersonic Blowing Trailing Edges

    Full text link
    Turbine blades operating in transonic-supersonic regime develop a complex shock wave system at the trailing edge, a phenomenon that leads to unfavorable pressure perturbations downstream and can interact with other turbine stages. Understanding the fluid behavior of the area adjacent to the trailing edge is essential in order to determine the parameters that have influence on these pressure fluctuations. Colder flow, bled from the high-pressure compressor, is often purged at the trailing edge to cool the thin blade edges, affecting the flow behavior and modulating the intensity and angle of the shock waves system. However, this purge flow can sometimes generate non-symmetrical configurations due to a pressure difference that is provoked by the injected flow. In this work, a combination of RANS simulations and global stability analysis is employed to explain the physical reasons of this flow bifurcation. Analyzing the features that naturally appear in the flow and become dominant for some value of the parameters involved in the problem, an anti-symmetrical global mode, related to the sudden geometrical expansion of the trailing edge slot, is identified as the main mechanism that forces the changes in the flow topology.Comment: Submitted to AIAA Journa

    Explaining dynamic cache partitioning speed ups

    Get PDF
    Cache partitioning has been proposed as an interesting alternative to traditional eviction policies of shared cache levels in modern CMP architectures: throughput is improved at the expense of a reasonable cost. However, these new policies present different behaviors depending on the applications that are running in the architecture. In this paper, we introduce some metrics that characterize applications and allow us to give a clear and simple model to explain final throughput speed ups.Peer ReviewedPostprint (published version

    DIA: A complexity-effective decoding architecture

    Get PDF
    Fast instruction decoding is a true challenge for the design of CISC microprocessors implementing variable-length instructions. A well-known solution to overcome this problem is caching decoded instructions in a hardware buffer. Fetching already decoded instructions avoids the need for decoding them again, improving processor performance. However, introducing such special--purpose storage in the processor design involves an important increase in the fetch architecture complexity. In this paper, we propose a novel decoding architecture that reduces the fetch engine implementation cost. Instead of using a special-purpose hardware buffer, our proposal stores frequently decoded instructions in the memory hierarchy. The address where the decoded instructions are stored is kept in the branch prediction mechanism, enabling it to guide our decoding architecture. This makes it possible for the processor front end to fetch already decoded instructions from the memory instead of the original nondecoded instructions. Our results show that using our decoding architecture, a state-of-the-art superscalar processor achieves competitive performance improvements, while requiring less chip area and energy consumption in the fetch architecture than a hardware code caching mechanism.Peer ReviewedPostprint (published version

    Enabling preemptive multiprogramming on GPUs

    Get PDF
    GPUs are being increasingly adopted as compute accelerators in many domains, spanning environments from mobile systems to cloud computing. These systems are usually running multiple applications, from one or several users. However GPUs do not provide the support for resource sharing traditionally expected in these scenarios. Thus, such systems are unable to provide key multiprogrammed workload requirements, such as responsiveness, fairness or quality of service. In this paper, we propose a set of hardware extensions that allow GPUs to efficiently support multiprogrammed GPU workloads. We argue for preemptive multitasking and design two preemption mechanisms that can be used to implement GPU scheduling policies. We extend the architecture to allow concurrent execution of GPU kernels from different user processes and implement a scheduling policy that dynamically distributes the GPU cores among concurrently running kernels, according to their priorities. We extend the NVIDIA GK110 (Kepler) like GPU architecture with our proposals and evaluate them on a set of multiprogrammed workloads with up to eight concurrent processes. Our proposals improve execution time of high-priority processes by 15.6x, the average application turnaround time between 1.5x to 2x, and system fairness up to 3.4x.We would like to thank the anonymous reviewers, Alexan- der Veidenbaum, Carlos Villavieja, Lluis Vilanova, Lluc Al- varez, and Marc Jorda on their comments and help improving our work and this paper. This work is supported by Euro- pean Commission through TERAFLUX (FP7-249013), Mont- Blanc (FP7-288777), and RoMoL (GA-321253) projects, NVIDIA through the CUDA Center of Excellence program, Spanish Government through Programa Severo Ochoa (SEV-2011-0067) and Spanish Ministry of Science and Technology through TIN2007-60625 and TIN2012-34557 projects.Peer ReviewedPostprint (author’s final draft

    Take a walk on the math side

    Get PDF
    En aquest article es presenta un projecte desenvolupat pel grup A de tercer d'ESO de l'Institut Flos i Calcat de Barcelona durant el curs 2017-2018. Amb el suport de l'aplicació Mobile History Map, l'alumnat crea un nomenclàtormatemàtic de la capital de Catalunya, un llistat i unmapa on s'indiquen els carrers i indrets de Barcelona amb noms relacionats amb les matemàtiques, i en clicar sobre cadascun d'ells s'accedeix a un joc de tres preguntes amb la ciutat i la nostra ciència com a protagonistes.This paper presents a project developed by third-year Group A students at the Institut Flos i Calcat high school in Barcelona in the 2017-2018 school year.With the support of theMobile HistoryMap application, the students created a list and amap indicating the streets and places of the capital of Cataloniawith names related to mathematics, where clicking on each one opens a game of three questions related to science and the city

    Improving GPU cache hierarchy performance with a fetch and replacement cache

    Get PDF
    In the last few years, GPGPU computing has become one of the most popular computing paradigms in high-performance computers due to its excellent performance to power ratio. The memory requirements of GPGPU applications widely differ from the requirements of CPU counterparts. The amount of memory accesses is several orders of magnitude higher in GPU applications than in CPU applications, and they present disparate access patterns. Because of this fact, large and highly associative Last-Level Caches (LLCs) bring much lower performance gains in GPUs than in CPUs. This paper presents a novel approach to manage LLC misses that efficiently improves LLC hit ratio, memory-level parallelism, and miss latencies in GPU systems. The proposed approach leverages a small additional Fetch and Replacement Cache (FRC) that stores control and coherence information of incoming blocks until they are fetched from main memory. Then, fetched blocks are swapped with victim blocks to be replaced in the LLC. After that, the eviction of victim blocks is performed from the FRC. This management approach improves performance due to three main reasons: (i) the lifetime of blocks being replaced is increased, (ii) the main memory path is unclogged on long bursts of LLC misses, and (iii) the average L2 miss delaying latency is reduced. Experimental results show that our proposal increases the performance (OPC) over 25% in most of the studied applications, reaching improvements up to 150% in some applications

    Variance-based sensitivity analysis in vehicle dynamics simulation: development and application of a polynomial chaos expansion method

    Full text link
    [ES] Durante las primeras fases del proceso de desarrollo de nuevos productos, muchos de los parámetros del producto desarrollado pueden cambiar y están por tanto sujetos a algún tipo de incertidumbre. En particular, en la industria automotriz, estas incertidumbres en los parámetros afectarán por último al comportamiento del vehículo que está siendo desarrollado. Si se tiene como objetivo un comportamiento dinámico concreto, resulta de suma importancia conocer que parámetro específico del vehículo tiene mayor importancia en determinar la incertidumbre de la salida de interés.[EN] During the early stages of any new product development, many parameters of the product being developed will change and are therefore subject to uncertainty. In particular, in the automotive industry, these uncertainties in the parameters will ultimately affect the behavior of the vehicle being developed. If particular vehicle dynamics are targeted, knowing which specific input parameter of the vehicle is more important in determining the uncertainty in the output of interest is of utmost importance.Valero Andreu, A. (2016). Variance-based sensitivity analysis in vehicle dynamics simulation: development and application of a polynomial chaos expansion method. http://hdl.handle.net/10251/73125.TFG
    • …
    corecore